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ABSTRACT 

The purpose of this paper, which is drawn from a 
larger analytic history of the National Teacher Evaluation (NTE) 
program, is to investigate issues of validity within the context of 
the program's 50-year history. Three major findings emerge from 
historical considerations relating to: (1) the continuity of test 
content and justification over the 50-year period of the program's 
existence; (2) the primacy of reliance upon logical or content 
validity; and (3) the paradoxical relationship of the tests to 
teacher education curricula. First, since their inception the 
examinations have measured three categories of teacher 
knowledge — basic intellectual and communicative skills, general 
cultural and contemporary background, and pedagogical and 
professional information. When changes did occur, they were 
undertaken for either financial reasons or as responses to specific 
criticisms. Second, there has been a strong tendency to justify the 
exams in terms of their logical or practical validity. Finally, 
despite the persistent assumption that the tests are needed because 
graduates of many teacher education programs are inadequately 
prepared, the source of test content and validity has been and 
continues to be focused primarily upon the perceived curricula of 
those programs. (PN) 
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Historical Issues of Validity and Validation: 
The National Teaciier Examinations 

Introduction 

In the past several years, concerns about school quality and teacher 
competence have focused public and professional attention on tests for teachers, 
most often on the National Teacher Examinations. This battery of standardized 
tests is currently used for teacher assessment and/or certification in some 
twenty-five states. It has also served as the model for the California Basic 
Educational Sl<ills Tests and the Pre-Professional Sl<ills Tests—exams used in 
California, Texas, and elsewhere for admission into teacher education programs. 
These tests were originally developed in the 1930's and are currently prepared by 
the Educational Testing Service. Reviewers in the Buros's publications have 
periodically criticized the tests' lack of empirically documented validity, but for 
the most part, neither the exams' content nor their validation have received much 
critical attention. Only in the past decade have legal challenges to the tests' use 
and documentation of their negative impact on minority teachers focused more than 
passing attention to issues> of validity and validation. 

The purpose of this paper, which is drawn from a larger analytic history of the 
NTE program,' is to investigate issues of validity within the context of the 
program's fifty year history. Drawing upon the literature of the sociology of school 
knowledge, sources of test validity [the relationships between a test and what it 
purports or is designed to measure] and methods for test validation [the procedures 
for documenting those relationships] are explored by relating the content and 
construction of successive versions of the examinations^ to ( 1 ) justifications for 
test use and assumptions about validity made by program administrators, (2) 
validation procedures and techniques recommended by test officials, and (3) major 
validity studies conducted by project and/or independent researchers. 

Historical Concepts of Test Validity 

Concepts of test validity have been evolving since the early part of the 
twentieth century. "The earliest writings on the subject recognized two types of 
validity, logical and experimental;^ The later involved "comparison of the results 
secured on [a] test . . . with those obtained from other measures of the same thing," 
and the former was based upon "the careful inspection and analysis of the test 
itself."^ Logical or practical validity was that "built into the test" by careful 
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planning and construction, sometimes with the use of explicit and comprehensive 
descriptive rationales.^ By the late I930's, professional attention focused upon 
"direct measurement as a means of attaining validity,"^ and experimentally 
determined validity was emphasized. Now often seen as 'ih^. step-child of 
testing,"'^ contemporary logical or content-related validity is concerned with the 
"degree to which the sample of Items, tasks, or questions on a test are 
representative of some defined universe or domain of content."^ Although practical 
validation continues to be promoted by some test theorists,^ most measurement 
experts have favored the collection of empirical data and the correlation of test 
scores with criterion measures. '° 

The initial technical standards prepared by the American Psychological 
Association (APA) in the early 1950's'' recognized four distinct types of 
val1d1ty~( 1 ) content validity involving "the sampling of a specified universe of 
content;" (2) concurrent validity involving "the relation of test scores to an 
accepted contemporary criterion of performance;" (3) predictive validity involving 
"the relation of test scores to measures [taken] at some later time;" and 
(4) construct validity Involving "more Indirect validation procedures . . . "'^ in the 
1966 revision of the standards, the predictive and concurrent categories were 
merged and treated as alternative forms of criterion-related validity. 

In recent years, construct validity with Its concern for "understanding the 
underlying dimensions or attributes being measured," ^'^ has come to be seen as a 
unifying concept which subsumes all other types of validity. The most recent APA 
standards still differentiate between content-related, criterion-related, and 
construct-related "evidences" of validity, but they state that gathering 
construct-related evidence "begins with test development and continues until the 
pattern of empirical relations between test scores and other variables clearly 
indicates the meaning of test score."'* Thus in many ways, "all validation is one, 
and In a sense all Is construct validation."'"' 

Antecedents of the National Te a cher Examinations 

Although exams for teachers have been used In the United states since colonial 
times, reliance upon them diminished with the development and expansion of 
teacher training programs. In the late 1920's and early 1930's,'8 nationwide 
emphases on school efficiency and accountability fueled by thriving intelligence and 
achievement testing and accompanied by a concurrent teacher surplus gave teacher 
testing new momentum. Research bureaus affiliated with urban school districts or 
with colleges and universities often constructed local tests for teaching candidates 
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along with those for school children^' and several batteries of tests for teachers 
were sold nationwide.^o within this context, Pennsylvania launched a state-wide 
educational study which, though not originally designed to test teaching candidates, 
led directly to the National Teacher Examinations. 

Beginning in 1925, the Pennsylvania study was funded by the Carnegie 
Corporation to evaluate the quality of and relationships between the states's 
secondary and higher education systems.^i in charge of the project were William 
Learned of the Carnegie Foundation staff and Ben Wood, a national authority on 
objective testing and the director of Collegiate Educational Research at Columbia 
University. In 1928, graduating seniors in Pennsylvania's high schools were given a 
massive battery of commercial intelligence and achievement tests. That same year, 
special twelve-hour exams developed by Wood were administered on a trial basis to 
the state's college seniors. After revision, these exams were given twice to those 
1928 high school graduates who went on to college in Pennsylvania— in 1930 and 
again in 1932. Selected groups of high school seniors were also tested. Containing 
matching, true-false, and multiple-choice items, the eMams were designed to assess 
intelligence, English, mathematics, and general culture. 

The testers assumed that these exams measured "significant aspects of liberal 
arts education" and that their validity was demonstrated both by the "scope, 
distribution, and character of the questions" and by "feasible external checks."22 
Scores showed gains over the two year period for most students at most 
institutions and correlated reasonably well with college grades.^^ 

The major finding of the Pennsylvania study was great variability in tested 
knowledge, variability which was exhibited among individuals and among 
institutions as well as within departments in the same institutions. Neither 
college attendance, nor class placement, nor school grades necessarily corresponded 
to knowledge displayed on the tests. The findings and interpretations of the 
Pennsylvania study led eventually to the creation of the American Council on 
Education's Cooperative Testing Service and to the development of secondary school 
and college guidance and testing programs. 

Though not an original focus of the research, the results of the Pennsylvania 
study became widely used to decry the academic quality of teachers and teacher 
candidates.24 Prospective teachers had tested particularly poorly. Their average 
scores were the among the lowest of the the entire sample. Learned and Wood 
devoted one chapter of their final report to an analysis of the teachers' achievement 
and concluded that "teaching attracts college students who vary widely in the 
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fundamental quality of their abilities and who fall below a knowledge minimum in a 
large proportion of cases."^ Although the authors stated that the eventual solution 
would require better programs and higher standards in the preparatory institutions, 
they put considerable emphasis upon the continued use of exams. They recommended 
that, prior to employment, school authorities test prospective candidates in order 
to "secure the best possible teachers for the money they have to pay."26 

The Original National Teacher Examinations 

The Cooperative Test Service of the American Council on Education began 
operations in 1930— partially to prepare tests for the Pennsylvania study. Funded 
by a ten year grant from John D. Rockefeller and directed by Ben Wood, the service 
was expected to develop multiple comparable forms of academic high school and 
college tests.^"^ Beginning in 1932, special editions of its exams were prepared and 
sold for use in teacher selection. By the late 1930's, the Service provided new 
versions yearly to some fifteen or twenty cities including Providence, Philadelphia, 
Pittsburgh, and Cleveland.^^ when the subsidizing grant expired, the 
superintendents sought additional foundational support. Again, as in Pennsylvania, 
the Carnegie Corporation provided the funds. 

In 1939, the American Council of Education established the National Teacher 
Examinations program to assist school administrators with teacher selection. A 
committee composed primarily of urban school superintendents whose systems had 
used the earlier tests was selected by the Council and charged with responsibility 
for the program.^' Under the supervision of Ben Wood as project director, the tasks 
of constructing, administering, and correcting the exams were assigned to the 
Cooperative Test Service. 

The initial tests were prepared following procedures originally used by Wood in 
Pennsylvania. Staff editors developed preliminary test outlines and tentative item 
specifications. General suggestions were gathered from the advisory committee and 
other administrators and supplemented with data gleaned in analyses of "courses of 
study, textbooks, journal articles, and reports of professional organizations."^ 
Outlines were sent to teacher education and school system personnel for review and 
criticism. Tentative items were tried out in several teacher training institutions. 

The original "common" exams,^' first administered in 1940, assessed that 
knowledge selected by the administrators as representative of what "all good 
teachers should know"^2__5asic intellectual and communicative skills, cultural and 
contemporary background, and professional Information. Modeled closely after 
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those developed for use in Pennsylvania, the exams were multiple choice in nature 
and emphasized aspects of general and contemporary culture, rather than pedagogy 
or professional knowledge. The 1 940 exam took eight hours and was composed of 
eleven separate tests. The titles and contributing portions to the 'common 
examination total score" were as follows: 





Intellectual and Communicative Skills 


30 percent 


1. 


Reasoning 


10% 


2. 


English Comprehension 


10% 


3. 


English Expression 


10% 




Cultural and Contemoorarv Background 


40 oercent 


4. 


Contemporary Affairs 


10% 




Test of General Culture: 




c 


current social ProDiems 


5% 


c 
D. 


History ana social studies 


5X 


7. 


Literature 


5% 


0 

0. 


Fine Arts 


5% 


9. 


Science 


5% 


10. 


Mathematics 


5% 


11. 


Professional Information 


... 30 oercent 




Education and Social Policy 


7.5% 




Child Development and Educational Psychology 


7.5% 




Guidance and Individual and Group Analysis 


7.5% 




Elementary flc Secondary School Methods 


7.5% 



Announcements for the project emphasized varied standards in teacher 
preparation institutions and the complex nature of good teaching. The exams, it was 
stressed, would help select the best candidates from a surplus which varied widely 
in ability and training. It was also suggested that "the opportunity to 'register' 
talents on a national scale " would be advantageous to candidates and institutions 
preparing teachers.^ 

Advertised nationally, test promotion was most successful in the urban areas 
of the New England and Middle Atlantic states where the practice of examining 
teaching candidates was already established and where a substantial surplus of 
teacher candidates existed. In a presentation to other urban administrators, 
Alexander Stoddard, Philadelphia's superintendent of schools and the chairman of 
the testing program's advisory committee, stressed the efficiency with which the 
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exams had selected candidates for only "a few scattered appointments in the past 
three years" from a waiting list of over 3000 "qualified" applicants.^ Program 
announcements emphasized that participation in the national program would save 
the time and expense of constructing, admiiUstering, and scoring local tests.^^ The 
exams were promoted as the most accurate and economical device known for 
measuring "essential elements of teaching ablllty."^^ 

Early Considerations of Validity 

From the beginning, program officials stressed that the exams did not measure 
the totality of teaching ability and therefore should not be judged by their 
correlation with "available criteria" of teaching ability. In an early talk to teacher 
educators, Ben Wood argued against "the naive error" of judging validity in terms of 
correlation with measures of teaching success. He likened the tests to physicians' 
thermometers and stethoscopes—valid instruments but not sufficient for a 
"complete diagnosis."^^ 

Early critics of the exams—many of whom were teacher educators^^— did. 
however, raise questions about their validity. One saw the test makers' 
disclaimers as admissions that "tlw really important things in teacher selection" 
were not being measured.^^ Another suggested that, rather than beginning with a 
definition of "good teaching," the test makers had asked: "What test Items of the 
kind suggested by school superintendents can we devise which will yield answers 
that are statistically reliable?"'*^ Not enough had been done, the critics maintained, 
to ascertain if persons who could score well on the exams were those also 
recognized as good teachers. 

Test personnel continued to argue that the value of the exams could not be 
judged by correlating them with "that composite we think of as teaching success."^' 
Since teaching ability was a complex combination of numerous Interacting factors, 
it was not "reasonable to expect any one of the essential factors to correlate highly 
with the total complex."^ In one much quoted article. Wood suggested that the 
tests should be judged, instead, by how accurately they measured those parts of 
teaching they were "designed to measure, namely, intelligence (linguistic and 
quantitative), general and specific cultures of the types judged desirable by the 
teacher-selecting authorities, and professional information."^ 

Test personnel stressed that the tests were "constructed by subject matter 
experts and test technicians so as to insure maximum validity and reliability."'^ In 
o 1 940, John Flanagan, associate director of the Cooperative Test Service, carried 
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out a preliminary empirical study^ which foreshadowed his later theoretical work 
on comprehensive test rationales."* Flanagan argued that an important type of 
validity is related to the way a test is constructed. "A test is valid," he stated, 
"when, according to experts, the sampling of content and mental processes in the 
test is similar to that indicated in the outline and specifications for the test."*^ 
This reliance on what was later called content validity— on careful construction and 
representative content— continued to be stressed in program materials for test 
users. 

Somewhat paradoxically, Flanagan also compared test scores to several 
commonly "available" measures of teaching ability— supervisors' and students' 
ratings. Using the test scores of experienced teachers who took the first exams in 
1940, he Identified twenty-two school systems with employees whose scores 
differed by at least 100 points. School superintendents were asked to secure both 
supervisory and pupil ratings for the forty-nine teachers selected. The correlation 
of supervisors' "overall judgment of the teachers' general effectiveness and 
desirability" was .51 * Correlations with other supervisory ratings were reported 
as "around .50." Pupil data were not reported In terms of correlation coefficients 
but suggested a relationship between test scores and student perceptions of teacher 
characteristics. 

Over the next few years, other Investigators attempted to assess validity 
experimentally. Some compared test scores to supervisors' or principals' ratings.*^ 
Since these "measures" were of such varied reliability. It Is not surprising that 
much of this work was criticized later by more psychometrlcally sophisticated 
testing proponents.^o Later In the decade. Investigations were broadened to Include 
comparisons with concurrent and predictive measures of achievement In college or 
graduate school.^' Although not directly discouraging this kind of research, both 
early and later program personnel tended to attribute the low to moderate 
correlations yielded by these studies to their theoretical or technlCbl Inadequacies. 

Financial Dis tress and Temoorarv Solutions: The NTH Program in the !94Q's 

With the second world war came a severe reduction In the number of applicants 
for teaching positions. The oversupply of teachers dissipated and so did the market 
for the examinations. Under the leadership of David Ryans,52 director of the 
Cooperative Test Service in the middle 1940's, the NTE program managed to survive 
by incorporating a number of cost-cutting procedures. 
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The major response to the adverse financial situation was the abandonment of 
almost all new test construction and the reuse of those exams already prepared. For 
the first three years of the program's operation, original and comparable forms of 
the entire exam had been constructed annually by the Cooperative Test Service. 
However, during 1943, 1944, and 1945 only the Contemporary Affairs section of the 
common battery was newly prepared each yearns Refurbished versions of the exam 
were provided by combining sections of the earlier three tests. 



In 1944, Ryans convinced the national advisory committee that this procedure 
could not continue indefinitely. The committee approved the reorganization of the 
general culture component and shortened the common exam enough to be 
administered in a single day, a move which resulted in lower administrative costs.^ 
To supplement his meager staff, Ryans found outside specialists—many of whom 
were affiliated with the testing bureaus of midwestern colleges and 
universities— willing to help prepare and review the exams.^^ In an attempt to 
mollify teacher educators, the weighting of the professional section of the HIE was 
modified. In addition, each of the professional tests began to be reported separately 
on the score profile "for guidance purposes."^^ Thus, beginning in 1946, the titles 
and contributing portions to the "common examination total score" were as follows: 



Intellectual and Communicative Skills 30 percent 

1. Reasoning ]0% 

2. English Comprehension 10% 

3. English Expression 10% 



Cultural Background 30 percent 

4. History, Literature, and Fine Arts 1 0% 

5. Science and Mathematics 10% 

6. Contemporary Affairs 10% 



Professional Information. 40 percent 

7. Education and Social Policy 10% 

8. Child Development and Educational Psychology 10% 

9. Guidance and Individual and Group Analysis 10% 

1 0. General Principles and Methods of Teaching 1 0% 



In order to save money, the number of items and total testing time were 
further reduced each year until 1950, the final year that the project was affiliated 
with the American Council on Education. Even with these modifications, the exams 
o used at the end of the decade were very similar to those originated in 1 940. 
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During and following the war, additional efforts were made to secure support 
and broader use within the teacher education community. Both the composition and 
the leadership of the national advisory committee were changed to include more 
representation by teacher training personnel. Reduced student fees were offered "to 
acquaint colleges and students with the program,"^ and promotional materials 
aimed at teacher education personnel were prepared.^^ in spite of these moves, 
however, test use by students remained very low, and it became clear that other 
sources of income would be needed. Supplementary grants from the Carnegie 
Corporation in 1940 and 1941 helped offset the war's immediate effects, but no 
further foundation monies were provided.^^ 

Beginning in 1944, additional revenue was secured from test sales to the State 
of South Carolina for use in a new teacher certification program. Over the next few 
years, these test administrations provided the major source of NTE funding.^ South 
Carolina's new system relied on NTE scores to determine "grade" of certification 
(and thus state salary reimbursement) for both experienced and new teachers and 
replaced a dual system based upon race similar to one which had been outlawed by 
the U.S. Supreme Court in 1940. A "validation" study conducted by teacher educators 
at the University of South Carolina with the assistance of Ben Wood as consultant^' 
compared selected groups of white teachers and teacher candidates. It concluded 
that "successful teachers in South Carolina are likely to make higher scores [on the 
National Teacher Examinations] than prospective teachers who are seniors in the 
colleges of the State."^ Subsequent state-wide administrations revealed that 
white teachers tended to outscore blacks and eventually the system was challenged 
in the courts. For years, however, NTE scores maintained a salary differential 
previously based explicitly on race.^ Although alluded to in one program 
publication,^ South Carolina's use of the tests for salary purposes was rarely 
described in program materials. 

For the rest of the decade, NTE informational and promotional materials were 
prepared by David Ryans. He also wrote most of what was published about test 
validity, drawing on his and others' earlier work, and usually reiterating familiar 
arguments. His 1 949 article for school administrators^ was drawn from "The 
National Teacher Examinations: Notes on the Question of Their Validity," an 
informational sheet he had prepared and provided to test users in 1946. it reported 
on "two preliminary statistical studies." The first of these was the Flanagan study, 
the other a comparison in one unnamed college of prospective teachers' scores with 
faculty ratings of their "probable success." Never published except as a brief item 
in Ryans's newsletter for potential exam users,** this research apparently was an 
attempt to provide validity data to justify test use in colleges and universities. 
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Like his predecessors, Ryans argued that high correlations between the NTE and "the 
usual criteria of teaching success" were unlikely because "no adequate criteria cf 
teachirKi success" yet existed and because the exams measured "only one phase of 
teaching ability." They did, he believed, "provide reliable estimates of the 
candidates' intellectual and cultural backgrounds."'*^ No mention was made of South 
Carolina's study. 

Despite underfunded and inadequate test development for much of this period, 
Ryans continued to emphasize the tests' content validity and indicated that the 
major source of their validity lay in the way in which they were prepared. He 
commended the tests' "constant" revisions and their relationship to "materials that 
are believed to be important for teachers to know" and concluded that "from the 
standpoint of their representativeness of types of materials and objectives they are 
prepared to measure, there is little question of the validity of the Teacher 
Examinations."^ 

Transitions and Recovery: The NTE Program in the 1950's 

Late in 1947, in order to deal "with testing and measurement in a coordinated 
manner and [to eliminate! duplication of effort/^' the American Council on 
Education merged its testing programs with those of the College Entrance 
Examination Board and the Graduate Record Office to form a new organization— the 
Educational Testing Service. Between 1948 and 1951, project administration, test 
preparation, and eventually sponsorship of the National Teacher Examinations 
program were transferred to the new agency 

Guided by the overall leadership of Henry Chauncey, president of the 
Educational Testing Service, and by the specific project direction of Arthur Benson, 
strenuous efforts were undertaken to economize, to make the program 
self-supporting and more efficient. Administrative procedures were simplified and 
the exams "streamlined."''^ 

The version of the National Teacher Examinations administered by ET5 in 1951 
was the shortest and quickest test in the histor/ of the program. The common 
examination, which in 1940 had included 1217 items to be answered in eight hours 
of working time, was reduced to three hundred items and a working time of just 
over three hours. In content and structure, however, the test was remarkably 
similar to those administered earlier. In fact, many of the items on the cultural and 
professional sections were taken directly from earlier tests.^ The test of reading 
ability was eliminated. Contemporary content from a previously separate subtest 
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was incorporated into the other general culture sections. A new "weighted common 
examination total" or "WCET" score was created to be comparable to the earlier 
total score. Titles and contributing portions to the "WCET" became as follows: 



Intellectual and Communicative Skills 20 percent 

1. Reasoning 10% 

2. English Expression 10% 

Cultural Baclcqround 40 percent 

3. History, Literature, and Fine Arts 20% 

4. Science and Mathematics 20% 

5. Professional Information. 40 percent 

Education as a Social institution 1 0% 

Child Oevelopment and Educational Psychology 1 0% 

Guidance and Measurement 1 0% 

General Principles and Methods of Teaching 1 0% 



Although the basic examinations changed little during the next decade, the 
program diversified with the development of specialized state- and 
institution-wide testing services, supplementary tests for administrators and 
others, and new subject-matter exams. Except for their shortening, however, the 
scope and the emphases of the common battery during the 1950's resembled those of 
the earlier exams. 

Most research conducted during this period was done by masters and doctoral 
students and involved the assessment of teaching candidates trained at a particular 
college. Exam scores were correlated with undergraduate grade point average^^ or 
with achievement test scores.^^ Correlations were also computed between National 
Teacher Examinations scores and various assessments of teaching ability.^ Most of 
what was published about the exams during the 1950's was prepared by NTE project 
director, Arthur Benson, who was responsible for both informational and 
promotional materials. As in the late 1940's, the tests were recommended to 
teacher educators for "institutional evaluation, counseling and placement activities, 
and screening ... for graduate worlc,"^^ but test use was still justified primarily in 
terms of widely varied teacher preparation. NTE results were said to be "a useful 
supplement to academic records since they [provided] school systems with 
comparable measures for all teacher applicants without regard to the standards of 
the institutions which prepared them."^ 
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The Educational Testing Service conducted no original NTE validation research 
during this decade but made references to content and concurrent validity In project 
publications. A statement in the program's first specialized pamphlet for users, the 
Handbook for School and College Officials, published in 1959, stressed that "a 
priori evidence as to content validity ... is inherent in the manner in which the 
tests are planned and constructed."'^ Potential users were encouraged to "Inspect 
the tests to determine the relevance of the test materials to their [own] purposes." 
Although no specific references were cited, the handbook stated that "periodic 
reports of studies which have related NTE scores to such criteria as grade point 
averages or credit hours of collegiate study have been consistent in supporting the 
[tests'l concurrent validity."79 

Predictive validation was presented as problematic. Benson repeatedly 
criticized "so-called validity studies"^ which attempted to measure the tests 
"against on-the-job performance"^' and argued that "vaguely defined ratings by 
supervisors or administrators" were no longer acceptable "as adequate criteria of 
teacher effectiveness."^ A further statement about "on-the-job" criteria appeared 
first as a footnote In the 1951 publication for users^ and then as part of the text in 
the J964 version.°^ It read: "The validity of the NTE is more appropriately judged 
on the basis of proximate criteria than on ultimate success in teaching. Until 
research establishes universally acceptable criteria of teaching effectiveness, 
results of validating the NTE against on-the-job performance of teachers are likely 
to be inconclusive. ..." in 1967, the statement was modified to blame the lack of 
predictive criteria on "professional educators:" "At present, professional educators 
are unable to agree on the meaning of teaching effectiveness.' Until educators are 
able to define and divide this criterion into components which can be validly and 
reliably measured, this method of substantiating or refuting the validity of the NTE 
will remain relatively unsuccessful."^ 

Professionalizing the Examinations: The NTE s in the IQfiO S 

Like teacher education a decade earlier, the National Teacher Examinations 
became the focus of growing critical attention in the I960's. The ascent of Sputnik, 
the poor showing of teachers on the Selective Service Qualifying Tests, concern 
about the alleged dominance of teacher preparation by "educationists,"''^ and an 
intense debate over the relationship between general and professional components 
of teacher education— all affected aUitudes toward teacher preparation and toward 
teacher tests. In 1961, an external review committee, nominated by the National 
Education Association's National Commission on Teacher Editcation and Professional 
Standards, recommended extensive alternations in the organization and the content 
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Of the exams. Although advocating test use "as an aid in teacher selection," the 
review committee recommended the establishment of "new norms based on a 
nationwide sampling of all prospective teachers"°^ and discouraged "the use of 
scores for other purposes, such as certification" until revisions were made, it also 
called for periodic program review and for involvement of additional persons not 
affiliated with ETS to help plan, write, and review the tests. 

These changes, many of which emphasized "professionalizing" the knowledge 
assessed on the tests, were implemented for the 1964-65 testings and involved the 
first major revisions of the exams since the Educational Testing Service took over 
the project more than a decade before. Eliminated at last was the nonverbal 
reasoning test which the committee had believed had "no particular relevance" for 
testing teachers and could not "be considered a test purported to measure academic 
preparation."''^ 

A new publication for teacher examination users, Prosoectus for School and 
College Officials, was prepared to 'aid school and college officials ... in making 
judgments regarding the appropriateness of the National Teacher Examinations 
program for their particular measurement needs and to assist them in planning to 
use the results of these examinations effectively."^ While noting that the question 
of "what knowledge is of most worth to prospective teachersr was considered in 
exam development, the booklet stressed that the program provided "objective exams 
of measurable knowledges and abilities which [were] commonly considered basic to 
effective classroom teaching and which typically [constituted] major elements in 
current programs of teacher education."** This concentration on teacher preparation 
programs as a source of the knowledge tested was emphasized in another new 
publication, the Technical Handbook, which appeared in 1965. it announced that "the 
chief purpose of the NTE is to provide an independent evaluation of the academic 
preparation of teacher education students."" 

The new battery was organized as a set of three general education tests and 
three professional educational tests. The titles and contributions to the new 
weighted common examination total were as follows: 

General Education 61 percent 

1. Written English Expression \\% 

2. Social Studies, Literature, and the Fine Arts 25% 

3. Science and Mathematics 25% 
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4. 
5. 
6. 



Professional Education. 

Societal Foundations of Education 
Psychological Foundations of Education 
Teaching Principles and Practices 



39 percent 



13% 
13% 
13% 



Two studies conducted by ET5 staff In the early 1960's further reinforced the 
focus on the tests' relationship to teacher education curricut a—Barbara Pitcher's 
study of concurrent test validity^ and Betty Humphry's survey of professional 
course offerings.^ Pitcher, an employee of ETS's Statistical Analysis Division, 
analyzed test score and grade point data of college seniors who graduated In 1959, 
1960, or 1961 from eleven teacher preparatory Institutions. Correlations between 
cumulative grade point averages and weighted common examination total scores 
ranged from .38 to .74 with a weighted average of .57. She concluded that this 
represented a reasonably high relation between test scores and college grades. 
Although published only as an Internal statistical report. Pitcher's research was the 
first NTE validation study undertaken by ETS personnel and for almost two decades 
was cited to document the exam's concurrent validity.** 

Head of the Education Section, Test Development Division, and In charge of 
preparing test specifications for the newly revised exams, Humphry surveyed 
professional education requirements in some 250 colleges and universities In 
1961-62. Finding considerable overlap in course requirements and materials used in 
institutions approved by the National Council for Accreditation of Teacher 
Education, she concluded that "there is perhaps more agreement concerning the basic 
content taught than might seem readily apparent."^ Though not mentioned as often 
as Pitcher's work, Humphry's survey was also cited in program publications as 
partial documentation of the exam's content validity. 

ResDondInq to External Pressures: The NTE In the 1970's 

Even before the recommendations of the 196 i review committee had been 
implemented with the restructuring of the 1964-65 common exam, there was 
growing impetus for furthet^ action. Rapid growth In test adoption by state and local 
school systems in the recently desegregated South^ and the denunciation of the 
examinations by the National Education Association focused attention on test 
validity and use. The yearly volume of candidates had more than doubled since the 
beginning of the decade, growing from the 37,000 tested In 1959-60'^ to almost 
73,000 in 1963-64.^ Much of this growth occurred in the South.^^ In 1963-64, 
eighty-one percent of those registering to take the exam at a nationwide 
administration resided in the South Atlantic or South Central regions of the 
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country.^^ By 1968, the exams were required of all candidates in South Carolina, 
North Carolina, Texas, and West Virginia and of applicants of the "grants-in-aid" 
program in Georgia. Additionally, they were often required locally in the District of 
Columbia, Maryland, Virginia, Georgia, Arl<ansas, Louisiana, and Oklahoma. For 
many southern community and state school systems, the decade of the I960's was a 
period of considerable turmoil and change, much of which was in response to 
court-ordered desegregation. In a number of these school systems, a related 
change was an increased reliance upon the National Teacher Examinations. 

Over time, test use became subject to greater and greater critical attention 
both within and outside of the Educational Testing Service, in 1966, the National 
Education Association resolved "that the use of examinations such as the National 
Teachers Examination [was] not a desirable method of evaluating teachers in service 

-103 By ^970^ it had strengthened its position against the exams and resolved 
"that examinations such as the National Teacher Examinations must not be used as a 
condition of employment or a method of evaluating educators in service for purposes 
such as salary, tenure, retention, or promotion." In the early I970's, the National 
Education Association joined the U.S. Justice Department in several court 
challenges'os in which "black educators in the deep South contended that [the 
exam's) use had a racially discriminatory effect on minority employment in the 
public schools."'°^ 

The Educational Testing Service and its advisory groups on teacher 
examinations responded to the concerns and criticisms in several ways. Formal 
guidelines for proper use were developed throughout the I960's and were 
distributed to test users in 1971. Existing tests were carefully scrutinized —both 
experlmentally'07 and with the review and revision of the test specifications. In 
1969, a panel of minority group educators was invited to review the exams and 
make suggestions. The following year the tests were modified in response to the 
panel's suggestions, most of which dealt with the content of specific sections. 
Beginning with the 1970-71 administrations, the test structure was as follows: 



General Education 61 percent 

1. Written English Expression 11% 

2. Social Studies, Literature, and the Fine Arts 25% 

3. Science and Mathematics 25% 

4. Professional Education 39 percent 
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The Impact of Judicial interpretations 

About this time, legal challenges to the use of other employment and licensing 
tests focused new attention to Issues of validity and validation. In March 1971, In 
Gri ggs v. Duke Power Company. the Supreme Court reinforced policies established 
by the U.S. Equal Opportunity Commission the previous year, in the first of several 
landmark cases, the high court ruled that employment tests, with a disproportionate 
exclusionary impact on groups protected by the Civil Rights Act of 1964, must be 
"shown to be related to job performance." 

Several months after the decision, James Deneen, then ETS's director of 
teacher examinations. Issued a statement on the ruling's Impact on teacher testing. 
He argued that the National Teacher Examinations were "job-related in so far as 
they measure knowledge that is needed and applied (n teaching. The tests' 
specifications and questions are prepared by specialists who teach the subjects 
examined at the college and university level and by school district teachers, 
supervisors, and administrators. The factors and items found in the NTE are based 
on teacher training programs. Thus the tests possess content validity, which Is 
basic to any achievement test."'<^ 

Deneen wrote of the content review by black educators and of the plans to add 
"more items which reflect the contributions of minority groups" in argumentation 
very similar to that recently used by ETS president Gregory Anr1g"° and others who 
defend the use of the teacher tests despite their documented negative Impact on 
minority teachers, Deneen wrote: "Most black teacher trainees who take the NTE are 
products of segregated colleges, segregated elementary and high schools, and 
segregated neighborhoods They are largely drawn from a population which has 
possessed little economic, social, or political power to change its educational 
environment. It is obvious that, regardless of their race, persons with such a 
background will generally score lower on an educational achievement test than their 
more privileged colleagues, it seems equally obvious that the appropriate response 
to this fact is not to depreciate the Importance of knowledge for teachers, but to 
make that knowledge available to all regardless of race or socioeconomic status."^ ^ ' 
Stating that the "Court's decision [pointed] up the urgency of developing more and 
better criteria for measuring teaching," Deneen also described some of the 
validation work then underway at ETS. 

Over the next few years, considerable internal attention was paid to Issues of 
validity and validation. Previous research was re-evaluated^ and new procedures 
were considered. Guidelines were provided so that users could conduct their own 
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validation studies,^ and self-reported grade point data were gathered during 
testings to further explore concurrent validity.' The most ambitious and most 
decisive validation study conducted by ETS in the 1970's was that undertaken for 
the State of South Carolina. It was this study which slov/ed the tide of court cases 
filed against the use of the National Teacher Examinations and established current 
IfTE validation procedures. 

In 1975, the United States Department of Justice, the National Education 
Association, and groups of South Carolina's teachers charged that the use of the 
National Teacher Examinations in South Carolina for teacher certification and as a 
factor in determining salary violated the equal protection clause of the Fourteenth 
Amendment and Title VII of the Civil Rights Act of 1964. In January 1978, the 
United States Supreme Court refused to accept the case for full briefing and oral 
argument''^ and summarily affirmed the 1977 decision of the Federal District 
Court' '^ which stated: "The State has the right to adopt academic requirements and 
to use written tests designed and validated to disclose the minimum amount of 
knowledge necessary to effective teaching.""'' "There is ample evidence in the 
record of the content validity of the NTE. The NTE have been demonstrated to 
provide a useful measure of the extent to which prospective teachers have mastered 
the content of their teacher training programs." ' ^'^ 

A good deal of the courts' faith in the content validity of the tests was based 
on the study conducted by the Educational Testing Service for the South Carolina 
Department of Education."' About 450 faculty members from some twenty-five 
teacher training institutions in South Carolina examined the test items to 
determine if they fairly sampled the knowledge which the teacher training 
institutions sought to impart. Content review panels judged "whether or not the 
content of each question . . . IwasJ covered by the teacher education program" and 
assessed "the relation between the description of test content . . and the curriculum 
in terms of omission or overemphasis."'^ Knowledge estimation panels provided 
"estimates of the percentages of minimally knowledgeable candidates who would be 
expected to know the answers to individual test questions."'^' Thus, faculty 
members' judgments as to the minimum amount of knowledge needed to complete a 
South Carolina teacher education program were used to calculate cutoff scores for 
the common exam and each of the area exams. 

In the next few years, the NTE were validated— using the South Carolina 
model—for certification in California, Louisiana, and North Carolina and for 
licensure by the American Speech-Language-Hearing Association. '22 in each case, 
test items were compared with curricula of the teacher training institutions and 



18 

the conclusion reached that the tests were "valid" because they were representative 
of the content taught In those programs. 

The Policy Council and Its New Core Battery: The NTE's in the 1980's 

In the current decade, the Educational Testing Service and those responsible for 
the NTE program have tried to avoid the unflattering controversy and costly legal 
entanglement of the past while profiting from a market created by an intense public 
demand for teacher testing. In 1979, ETS selected a twelve person external board, 
the "National Teacher Examinations Policy Council," to govern and direct NTE 
program policies involving the development, administration, and use of the 
exams. '23 Members, who beginning in 1982 also included classroom teachers, were 
drawn from states and school districts that used the tests and from user and 
non-user Institutions of higher education and were appointed In order to "make the 
program more responsive to user requirements." '24 created to insulate ETS from 
controversial policy and legal decisions, the Policy Council was given "all policy 
making responsibility" for the NTE program.'25 ets personnel, however, continue to 
take responsibility— and credit— for popular actions such as the decision to 
disallow NTE sales for the testing of experienced teachers in Arkansas.'26 

Reiterating that "the basic purpose of the tests" was "to provide a measure of 
academic preparation for beginning teachers,"'27 the Policy Council Introduced a 
major NTE revision in the fall of 1982. Criticisms of the previous tests and the 
"screening, counseling, guidance, and feedback needs of teacher education 
institutions" were taken into consideration. Consisting of three distinct sections to 
be administered at the same time or In separate two-hour blocks, the new Core 
Battery samples content similar to that covered in the earliest National Teacher 
Examinations— basic communicative skills, cultural background, and professional 
information. This time, however, the three tests are scored and reported 
separately and are not combined Into a single score. '28 Test names, sections, and 
components'29 are as follows: 

Test of Communication Skills 

1 . Listening 40 multiple choice questions 

2. Reading 30 multiple choice questions 

3. Writing 45 multiple choice questions 

4 Writing one essay question 
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1. 
2. 
3. 
4 



Test of General Knowledge 

Social Studies 

Mathematics 

Literature and Fine Arts . . 
Science 



30 multiple choice questions 
25 multiple choice questions 
35 multiple choice questions 
30 multiple choice questions 



Test of Professional Knowledge 
4 sections of 35 multiple choice questions each, only 3 of which are scored. 

Careful to operate within the bounds of past court decisions, ET5 personnel 
initially stressed that because "the Core Battery was sufficiently different from 
the Common Examinations, the qualifying scores established for the Commons [could 
not! be used with the Core Battery Tests." Thus score users were advised that it 
would be necessary that they conduct new validity studies "to examine the 
relationship of the new test content to what is taught . . ."'^i 

Validation of "what is taught" is, however, no longer legally sufficient. 
Responding to previous legal challenges and to the 1978 adoption of the Equal 
Employment Opportunity Commission's Uniform Guidelines on Employee Selection 
ProceduresJ 32 the latest NTE use guidelines require that the "HIE Program tests be 
validated for the specific purposes for which they are being used."^^ They point out 
that "in addition, federal and other civil rights laws, such as Title VI and Title VII 
of the Civil Rights Act of 1964, may also require validation if the use being made of 
the tests is shown to disproportionately disadvantage members of ethnic, racial, 
religious, or gender subgroups."^^ Users are referred to appropriate "professional 
and legal standards" and are advised that "in some cases these standards require the 
use of job analyses or other similar techniques."'35 

Earlier this year, ET5 president C regory Anrig announced that a recent "job 
analysis project" which involved the participation of some 3000 classroom teachers 
will soon be published and will be used in the future "to assist in developing and 
validating the NTE for state certification."^** jq ^gte, however, validation of the 
Core Battery has involved judging procedures similar to those followed in the South 
Carolina study. In addition to considering the tests' similarity to teacher education 
curricula, judges—who may be teacher education personnel, or practicing 
teachersJ38 or both'^O—are now asked to compare test items to that knowledge 
required by beginning or minimally qualified teachers. 
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Conclusion 

Three major findings emerge from historical consideration of the validity and 
validation of the National Teacher Examinations. These relate to ( I ) the continuity 
of test content and justification over the fifty year period of the program's 
existence, (2) the primacy of reliance upon logical or content validity, and (3) the 
paradoxical relationship of the tests to teacher education curricula. 

First, since their inception the National Teacher Examinations have measured 
three categories of teacher knowledge— basic intellectual and communicative 
skills, genera] cultural and contemporary background, and pedagogical and 
professional information. Although the exam evolved from the comprehensive 
multi-sectioned battery administered in 1940 to the mere narrowly focused tests 
of the 1960's and 1970's and then to the three part core of the 1980's, clearly, that 
first test set the pattern for those which followed it There has been a strong 
tendency to maintain the status quo and to continue relying upon previous rtodels of 
the exams even when those models were "inherited" from prior agencies or test 
developers. No additional or innovative sections were adopted until 1962. 

When changes did occur, they were undertaken for either financial reasons or as 
responses to specific criticisms. Changes made in the 1940's and 1950's were 
undertaken to save construction and administration costs and those made in the 
1960's reflected the criticisms of the NCTEPS and minority group review 
committees. Certainly, the restructuring of the exam in 1982 responded both to 
previous criticisms and to a perceived new test market. Throughout this evolution, 
however, the official Justification for test use has continued to focus upon the 
perceived incompetence of many teachers and the assumed inadequacy of their 
training. 

Second, there has been a strong and persistent tendency to justify the exams in 
terms of their logical or practical validity. Statistical validation was 
de-emphasized and even ridiculed by many of those in charge of the program until 
necessitated by the courts. Again and again, test validation was justified using 
arguments and strategies which were no longer appropriate or true. Reliance upon 
explanations of the tests' careful construction persisted through periods in which 
actual construction was Inadequrite. Justification based upon the Inadequacy of 
existing teacher training was used even when those being tested had long been away 
from the training Institutions. 
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And finally, despite the persistent assumption that the tests are needed 
because graduates of many teacher education progrzms are inadequately prepared, 
the source of test content and validity has been and continues to be focused 
primarily upon the perceived curricula of those programs. 

The original tests assessed knowledge selected by administrators as 
representative of what all "good teachers should know." Program officials stressed 
that the exams did not measure the totality of teaching ability and therefore should 
not be judged by their correlation with "available criteria of teaching success." 
Despite this, a number of early studies attempted to demonstrate the exam's 
predictive validity by comparing test scores with supervisors' or principals' ratings. 
Later, investigations were broadened to include comparisons with concurrent 
measures of achievement in college or graduate school. Program personnel have 
tended to attribute the low to moderate correlations yielded by these studies to 
their theoretical or technical inadequacies. 

By the mid 1960's, the examinations ""^''-^ said to appraise "basic professional 
preparation and general academic attainment. Content validity— Justified 
primarily in terms of the qualifications of those national^/ selected and recognized 
experts who assisted in test development— was emphasized in the program's 
publications. 

Beginning in the early 1970's, a number of law suits charged that the tests 
were being used in some states and communities to discriminate against minority 
teachers and teacher candidates. Teachers' unions and other critics claimed that 
the tests were inappropriate because they were not "validated" against job-related 
criteria. In 1978, the U.S. Supreme Court ruled in favor of the exams' use and thus 
Indirectly in favor of their content validity. Although practicing teachers are now 
asked to judge the relevance of the items, current validation procedures tend to 
closely mirror those used in the South Carolina study and focus upon the similarity 
of test content to the curricula of training institutions. This is the case despite the 
prevalent assumption that testing is now justified, as it was fifty years ago, on the 
basis of these Institutions graduating inadequately trained teachers. 
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